Abstract
Sickle cell disease is an inherited red cell disorder that arises from a single nucleotide mutation in the beta hemoglobin gene, yet its effects are far-reaching and multifaceted. Despite its monogenetic origin, the disease leads to a cascade of complications due to the altered shape and rigidity of red blood cells which can cause episodes of severe pain, organ damage, and an increased risk of infections. The variability in clinical presentation and outcome among individuals also highlights the complexity of interactions between genetic, environmental, and lifestyle factors influencing the severity and manifestations of the disease and long-term survival of patients. Clinical risk factors and biomarkers have been extensively studied and have been utilized as decision-making tools, but genetic variants co-inherited with the causal hemoglobin mutation also influence disease progression and mortality. However, identifying prognostic variants is challenging due to the complexity of the genome and the subtle effects of individual markers. Here, we used a combination of machine learning and survival analysis methods to identify and validate genetic variants that can help predict long-term mortality risk in patients with SCD.
Methods: We studied a cohort of 673 patients with SCD (554 HbSS and HbSβ⁰, 91 HbSC, 25 HbSβ+, 3 HbSD and HbSO) enrolled at the National Heart, Lung, and Blood Institute (NCT00011648) from 2006 to 2025. Using whole genome sequencing, we identified 75 genetic variants across 50 genes linked to sickle-related complications as reported in the literature. A random survival forest approach ranked the variants by their importance in predicting survival. The top 10 variants were further analyzed using Kaplan-Meier methods, followed by further detailed modeling for the following five significant variants: rs1427407 (BCL11A), a compound APOL1 variant (rs73885319 + rs71785313), rs2235302 (SELP), rs7412 (APOE), and rs76992529 (TTR, V122I).
Based on their ranking, patients were then classified into four mutually exclusive genetic risk groups:
Group 1 (n=58): Carriers of rs1427407 (BCL11A) or the APOL1 compound variant only;
Group 2 (reference group, n=351): Non-carriers of all five variants;
Group 3 (n=240): Carriers of rs2235302 (SELP) or rs7412 (APOE) only;
Group 4 (n=24): Carriers of rs76992529 TTR V122I only.
A Cox proportional hazards model assessed survival differences using Group 2 as reference. Kaplan-Meier curves and 10-year survival estimates were generated, and model performance was evaluated using the concordance (C-index). These analyses were also performed on a subgroup of patients with severe genotypes (HbSS or HbSβ⁰; n=554).
Results: Survival rates varied significantly across the four genetic groups (log-rank p < 0.0001). Compared to non-carriers (Group 2), Group 1 had a 57% lower risk of death (HR = 0.43, 95% CI: 0.23–0.80, p = 0.0075). Group 3 had a 57% higher risk of mortality (HR = 1.57, 95% CI: 1.20–2.04, p = 0.0008), while Group 4 had the highest risk with a threefold increase in death risk (HR = 3.05, 95% CI: 1.68–5.55, p = 0.00025). In Kaplan-Meier analysis, ten-year survival rates were 82.7% for Group 1, 64.7% for Group 2, 47.2% for Group 3, and 27.7% for Group 4, indicating significantly better survival in Group 1 and much earlier mortality in Group 4. The model's C-index was 0.601, indicating modest predictive ability. Despite not being highly discriminative, the model effectively captured a meaningful risk stratification. Similar patterns were observed in the subgroup of patients with severe genotypes (HbSS or HbSβ⁰;), supporting the robustness of these genetic risk classifications.
Conclusion: This study demonstrates that combining machine learning with survival analysis is effective in identifying genetic variants associated with SCD mortality. Using random survival forest ranking and Kaplan-Meier analysis, we identified five variants that stratify patients into four distinct genetic risk groups, with different survival outcomes. The findings suggest a protective effect of rs1427407 in BCL11A and the APOL1 compound variant (rs73885319 and rs71785313). The TTR variant was associated with the poorest outcomes, and rs2235302 in SELP or rs7412 in APOE had a modest negative impact. These stratification patterns highlight the potential of genetic profiling to improve personalized risk prediction and guide long-term management in patients with SCD.
This feature is available to Subscribers Only
Sign In or Create an Account Close Modal